Mapping custom instructions for the Toshiba media embedded processor (MeP)

Mapping custom instructions for the Toshiba media embedded processor (MeP)
By Hammad Hamid, Celoxica, Courtesy of Programmable Logic DesignLine
45: 21 2005 (11:40 AM)
URL: http://www.embedded.com/showArticle.jhtml?articleID=175007551

Although processor to hardware partitioning can be successfully resolved by a combination of designer experience, precedent, tools, such as profilers and data-transfer analyzers …and a degree of patience and understanding, no engineer underestimates this task. Toshiba's MeP (media embedded processor) is a case in point.

Developed by Toshiba, the MeP is a programmable platform for creating a system-on-chip (SoC) that is targeted at applications that require digital media processing functionality such as video and audio. The multiple standards that apply to digital media are constantly evolving; thus, in order to be competitive in this dynamic environment, complicated functions need to be implemented in a short space of time and in a platform that can efficiently reuse intellectual property (IP). As an answer to this, the MeP is provided to users as soft IP. The MeP IP is divided into the following categories:

Core IP: The processor core section that is central to MeP.

Extension unit IP that realizes high-performance and sophisticated MeP modules by being connected to extension interfaces of the MeP core.

Peripheral IP, such as the DMA controller and bus interfaces that form a MeP module or MeP SoC.

Once the partitioning is has been completed and verified as being accurate, the next challenge is to map the partitioned design onto a processor and custom hardware architecture that consists of fixed buses. In a scenario based on designing with the MeP in its Developers Kit, the custom hardware design is wrapped with logic compatible to the chosen bus. The kit provides three main types of bus: the control bus, the DSP Instruction bus, and the local bus. These buses are very different in their protocols, and user mapped logic is subsequently affected when mapping across them. The size, depth, and flow of logic mapped onto these buses will affect the performance of the entire system during transactions.

One common convention used by designers is to map the logic to fit the bus protocol, as in the DSP instruction bus where designers map their custom hardware as read-write instructions. For larger-sized and complex instructions, the insertion of instruction "busy cycles" becomes necessary and results in the processor pipeline being stalled. The processor should be free to carry on with its miscellaneous tasks and not be hung by the processing instruction.

A clever way to design these instructions is to make the processor push and pop calculations from the instruction pipeline, where the parameters and operations are registered, buffered, processed, and buffered again to be returned. Here, the designer reduces the logic depth of the hardware and increases performance by pipelining the instructions through. The controlling processor is freed to spend time executing other instructions and can, with careful mapping, be interrupted once the result is available.

In the case of the Toshiba MeP processor, this is achieved by using read-only and write-only instructions for the DSP instruction bus, thereby allowing calculations to be piped through at the fastest rate executable by the processor. Results are buffered and retrievable at the fastest rate possible. In this way, the Toshiba MeP processor can execute batches of instructions as DSP or non-DSP etc. in order to efficiently pipeline instructions through its architecture (Fig 1).

1. A pipelined multi-cycle operation designed to return the result with the executed instruction: the ideal way to fully map the design to the instruction.

The processor pipelines all stages where the instruction does not raise the BUSY signal. Assuming that the DSP instruction performs the operations for 2 cycles, then (assuming 'n' is the number of pipelined DSP instructions) this results in a total of 4n+1 cycles. For example running 10 DSP instructions results in 41 cycles of operation by the hardware.

2. Pipelined operations broken into two instructions – one to receive the parameters and one to return the result.

Without any BUSY cycles, the processor can pipeline the stages of the two instructions. Instead of each DSP instruction, two instructions are now executed with latency between them. For the above 10 instructions, 10 parameters are sent to the processor and 10 results are received in 20 cycles. The main point here is that the number of instructions written is never greater than the buffer size so as to avoid an overflow condition. A more optimal design combines the two instructions, knowing that coherent data is received after a known number of DSP instruction writes.

Processor-to-hardware partitioning is not a simple task, but as we have seen here, the proper tools combined with a little knowledge can make it far less daunting and, in fact, something of an enjoyable challenge.

(More information on the MeP is available from http://www.mepcore.com/; more information on the MeP Developer's Kit is available from www.celoxica.com/products/mep/default.asp).

Hammad Hamid is a Senior Design Engineer with Celixica. Hammad joined Celoxica in 1999 as a member of the company's early phase engineering team. He has worked in software tools development and applications engineering roles, and was the lead engineer developing Celoxica's hardware/software co-design technology. With a worldwide remit, Hamid is currently concentrating on projects involving custom microprocessor development and integration. He graduated from the universities of Loughborough, UK and Hull, UK with a B.Eng Aeronautical Engineering and M.Sc. Computer Graphics and Virtual Environments. Hammad can be reached at Hammad.Hamid@celoxica.com.